-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: sync entries to disk in parallel with followers #19229
storage: sync entries to disk in parallel with followers #19229
Conversation
I'd like to do some testing with this on real clusters, but I guess I'll have to wait until |
I think |
It is, at least the initial version: https://github.com/cockroachlabs/roachprod |
Great, thanks Marc! |
The commit message indicates that we decide whether to do this based on the entries, but the code makes the decision based on the messages (writing to disk asynchronously if all messages are MsgApps). I think this is the better approach and could be structured to avoid two calls to sync (instead of "maybe sync, send messages, maybe sync", it could be "maybe send messages, sync, maybe send messages"). The latter construction has the nice property that we could easily transform it into "send appends, sync, send other messages". I think that would be safe. Keeping MsgApps in order is an important optimization, but I don't think it matters whether we reorder MsgApps relative to other messages. The one thing I'm not sure about is whether it's OK to send out a commit index in MsgApps that we have not ourselves written in HardState.Commit (I assume this is the reason for the IsEmptyHardState check in this diff). I think that's safe - the raft thesis says that the commit index may be treated as volatile state, although etcd/raft keeps it in HardState and I can't remember the details of why. So maybe we keep the IsEmptyHardState check to decide whether we send appends before or after syncing. Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, all commit checks successful. Comments from Reviewable |
How much work would it be to finish this off? I think it'll be a pretty nice win when disks are heavily contended. |
Not very much, a day or two of work at most. It currently grouped into the list of "experiment done, implementation pending" perf work that's on my timeline! |
@bdarnell looking back at the history of Outside of that question, I think I've also been able to convince myself that broadcasting a Like you mention, Section 3.8 of the Raft thesis states the following:
So unless I'm missing something there, the only thing that would stop us from unconditionally doing the
approach (regardless of the |
Thinking about this more, if it isn't necessary to sync the commit index to disk (in both Raft and |
Polling the group to get the commit index (as described in section 3.8) only works if you know the members of the group, but the members of the group may be changed by committed log entries (described in chapter 4) . You need an approximately up-to-date view of the commit index in order to identify the correct nodes to poll. This is mainly an issue when the state machine is transient (or has very infrequent checkpoints) and recovery relies more heavily on replaying the log. This is the reason (IIRC) that the commit index is persisted. I think that since we persist our applied index (and truncate the raft log) frequently, there is little value to us in storing the commit index. I think that removing the commit index from our persisted HardState and instead setting it to the applied index at startup would be safe. (or more conservatively, persist HardState.Commit only when we commit a config change) |
I experimented with making this change, which allowed us to sync the |
For etcd, it was just convenient. And we do not really see any performance improvement if we skip commit index sync. |
8ed8128
to
a7f0030
Compare
@bdarnell I've updated this perform the "maybe send messages, sync, maybe send messages" strategy you discussed above. PTAL. I'll follow up with more benchmarks. |
Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful. pkg/storage/replica.go, line 3690 at r1 (raw file):
This could use a unittest to make sure we get the boundary conditions right (I'm not sure how common mixed batches are otherwise in our tests) Comments from Reviewable |
a7f0030
to
323ffa3
Compare
Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful. pkg/storage/replica.go, line 3690 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. Comments from Reviewable |
Referenced in cockroachdb#17500. This change implements the optimization in the Raft thesis under the section: 10.2.1 Writing to the leader’s disk in parallel. The optimization allows the leader to sync new entries to its disk after it has sent the corresponding `MsgApp` messages, instead of before. Here, we invoke this optimization by: 1. sending all MsgApps. 2. syncing all entries and Raft state to disk. 3. sending all other messages. Release note (performance improvement): Raft followers now write to their disks in parallel with the leader.
323ffa3
to
b67eb69
Compare
@nvanbenschoten do you have a performance improvement measurement for this optimization? |
@xiang90 it's certainly going to depend on your workload and your cluster topology, but I can share the benchmark results I recorded right before merging this. Workload
(see loadgen/kv) Cluster
Results
Performance Improvement(averaged over three trials before and after the change) Throughput: +23% Keep in mind that this is a write-only workload with decently large writes ( |
awesome! thank you! |
Referenced in #17500.
This change implements the optimization in the Raft thesis under the
section: 10.2.1 Writing to the leader’s disk in parallel. The optimization
allows the leader to sync new entries to its disk after it has sent the
corresponding
MsgApp
messages, instead of before.Here, we invoke this optimization by:
The write latency speedup we see from this change is promising. On a
write-only workload it's demonstrating anywhere from a 5%-15%
speedup for average latencies.
For instance, when running:
we get the following histograms (thanks @jordanlewis!):
Zoomed on average latency
Zoomed on tail latency